| Data Type | Data Links |
|---|---|
| Loading ITables v2.2.5 from the internet... (need help?) |
RR9 Retina Dataset Integration
Digital Twin Project Description
Space biology confronts a critical obstacle: the challenge of incomplete data due to the logistical complexities and high costs of space missions. Addressing this issue, this research presents strategies that integrate AI and digital twin technology to overcome the limitations posed by sparse datasets in space biology research.
By presenting a cohesive strategy that combines synthetic data generation, automatic labeling, and advanced machine learning with digital twins, we showcase an application to the RR9 dataset at OSDR. This research aims to overcome the challenge of data scarcity in space biology, thereby forging a way to unlock insights into the potential of life beyond Earth.
RR9 Background
The Rodent Research 9 payload consisted of three space biology experiments designed to examine impacts of long-duration spaceflight on visual impairment and joint tissue degradation that affect astronauts.
| Investigation | Purpose | Experiments |
|---|---|---|
| Investigation 1 | Effects of microgravity on fluid shifts and increased fluid pressures that occur in the head. | 1. To determine whether spaceflight on the ISS alters rodent basilar artery spontaneous tone, myogenic and KCl (Potassium Chloride)-evoked vasoconstriction, mechanical stiffness and gross structure. 2. To estimate whether spaceflight on the ISS alters the blood-brain barrier in rodents, as indicated by ultrastructural examination of the junctional complex of the cerebral capillary endothelium. 3. To determine whether spaceflight on the ISS alters rodent basal vein (inside cranium) and jugular vein (outside cranium) spontaneous tone, myogenic and KCl-evoked constriction, distension, and gross structure. 4. To determine whether spaceflight on the ISS alters the ability of the cervical lymphatics to modulate lymph flow, and thus, regulate cerebral fluid homeostasis. |
| Investigation 2 | Impact of spaceflight on the vessels that supply blood to the eyes. | 1. Define the relationships between spaceflight condition-induced oxidative stress in reactive oxygen species (ROS) expression and retinal vascular remodeling and BRB function in mice return to Earth alive. 2. Determine whether spaceflight condition-induced oxidative damage in retina is mediated through photoreceptor mitochondrial ROS production. |
| Investigation 3 | Extent of knee and hip joint degradation caused by prolonged exposure to weightlessness. | 1. Determine the extent of knee and hip joint degradation in mice after 30 days of spaceflight on the ISS. 2. Use the DigiGait System to assess gait patterns before and after returning from the ISS. |
We are interested in the Retinal data and all things related to Eye. So, all the experiments related to Investigation 1 and 2 will be studied here. Below is the table of all OSD identifiers related to above investigations obtained from https://osdr.nasa.gov/bio/repo/data/payloads/RR-9
| Identifier | Title | Factors | Assay Types |
|---|---|---|---|
| OSD-557 | Spaceflight influences gene expression, photoreceptor integrity, and oxidative stress related damage in the murine retina (RR-9) | Spaceflight | Bone Microstructure Molecular Cellular Imaging histology |
| OSD-568 | Characterization of mouse ocular responses (Microscopy) to a 35-day (RR-9) spaceflight mission: Evidence of blood-retinal barrier disruption and ocular adaptations | Spaceflight | Molecular Cellular Imaging |
| OSD-715 | Characterization of mouse ocular response to a 35-day spaceflight mission: Evidence of blood-retinal barrier disruption and ocular adaptations - Proteomics data | Spaceflight | protein expression profiling |
| OSD-255 | Spaceflight influences gene expression, photoreceptor integrity, and oxidative stress-related damage in the murine retina | Spaceflight | transcription profiling |
| OS-140 | Space Flight Environment Induces Remodeling of Vascular Network and Glia-Vascular Communication in Mouse Retina | Spaceflight | |
| OSD-583 | Characterization of mouse ocular responses (intraocular pressure) to a 35-day (RR-9) spaceflight mission: Evidence of blood-retinal barrier disruption and ocular adaptations | Spaceflight | Tonometry |
The purpose of this notebook is to combine all retina data from the Rodent Research 9 (RR9) mission from the NASA Open Science Data Repository, perform exploratory data analysis, impute missing data and train a digital twin.
Original Author: Lauren Sanders
Additional Author(s): Jian Gong, Vaishnavi Nagesh
Load Data
We are downloading all the relevant data as shown in the the table below.
Data Exploration and Validation
The below table shows number of features in each dataset that constitute the RR9 multi-modal data. This is useful in identifying the maximum number of PCA components required to explain the cumulative variance in the dataset.
| Data | Rows X Features |
|---|---|
| Loading ITables v2.2.5 from the internet... (need help?) |
Summary of the Merged Data frame
PCA on Different Categories of Datasets
PCA on RNASeq
PCA on RNASeq for Only Predictive Genes for Phenotype
Code
genes_predictive_of_phenotypes = [
'ENSMUSG00000021185',
'ENSMUSG00000021432',
'ENSMUSG00000021712',
'ENSMUSG00000023484',
'ENSMUSG00000025484',
'ENSMUSG00000026768',
'ENSMUSG00000028184',
'ENSMUSG00000028423',
'ENSMUSG00000029499',
'ENSMUSG00000036636',
'ENSMUSG00000039994',
'ENSMUSG00000041685',
'ENSMUSG00000042190',
'ENSMUSG00000045318',
'ENSMUSG00000050538',
'ENSMUSG00000052373',
'ENSMUSG00000068250',
'ENSMUSG00000068394',
'ENSMUSG00000070822',
'ENSMUSG00000073879',
'ENSMUSG00000084408',
'ENSMUSG00000097061',
'ENSMUSG00000097180',
'ENSMUSG00000106147',
'ENSMUSG00000107195',
'ENSMUSG00000110357',
]
set(genes_predictive_of_phenotypes).issubset(rnaseq.columns.to_list())
# Filter the RNASeq dataset to include only the genes predictive of phenotypes
filtered_rnaseq = rnaseq[genes_predictive_of_phenotypes]
# Perform PCA on the filtered dataset
phenotype_rnaseq_pca_df, scatter_plt = perform_pca(rr9_all_df, dataset_name=filtered_rnaseq)
scatter_plt.show()PCA on Proteomics
PCA on TUNEL Assay
PCA on HNE Immunostaining Micoscopy
PCA on Micro CT
PCA on Combined Immunostaining Micoscopy data from Zo-1, PECAM, PNA and HNE
Zo-1, PECAM, PNA and HNE are all immunostaining microscopy. It would be useful to see if combining them all helps in better separation between the groups.
From the PCA analysis of different datasets, it seems like there is fair separation betwee flight and other groups. However, there isn’t sufficient data to show separation between GC, Viv and CC groups.
Data Analysis Correlation Between HNE and RNASeq
HNE Immunostaining microscopy has data across four different groups (F, GC, Viv, CC2) and RNASeq is available for two different groups (F and GC). Among these F15-F20 and GC15-20 have data for both HNE and RNASeq.
To be able to anchor the imputation of RNASeq data onto biological characteristic, the correlation between RNASeq and HNE needs to be determined.
Code
def analyze_correlation(dataset_name, gene_list, rr9_all_df):
"""
Analyze the correlation between a given dataset and a list of genes.
Parameters:
- dataset_name: DataFrame, the dataset to analyze (e.g., TUNEL, HNE, etc.)
- gene_list: list, the list of genes to analyze
- rr9_all_df: DataFrame, the merged RR9 dataset containing all data
Returns:
- None, displays heatmaps for correlation matrices
"""
# Select relevant columns
rna_cols_to_select = gene_list + ['Source Name', 'Group']
dataset_cols_to_select = dataset_name.columns.tolist() + ['Source Name', 'Group']
# Filter and drop missing values
rnaseq_filtered = rr9_all_df[rna_cols_to_select].dropna(how='any').reset_index(drop=True)
dataset_filtered = rr9_all_df[dataset_cols_to_select].dropna(how='any').reset_index(drop=True)
# Merge the two datasets
combined_df = pd.merge(dataset_filtered, rnaseq_filtered, on=['Source Name', 'Group'], how='inner')
groups = combined_df['Group'].unique()
# Iterate over each group and compute the correlation matrix
for group in groups:
# Filter data for the current group
group_indices = combined_df[combined_df['Group'] == group].index
dataset_group = dataset_filtered.loc[group_indices]
rnaseq_group = rnaseq_filtered.loc[group_indices]
# Concatenate the two dataframes along the columns
combined_group_df = pd.concat([dataset_group, rnaseq_group], axis=1)
# Compute the correlation matrix
correlation_matrix = combined_group_df.corr(numeric_only=True)
# Plot the heatmap
sns.heatmap(
correlation_matrix,
cmap='coolwarm',
annot=False,
cbar_kws={'label': 'Correlation Coefficient'}
)
plt.title(f"Correlation Matrix for Group: {group}")
plt.show()
analyze_correlation(hne, found_genes, rr9_all_df=rr9_all_df)
analyze_correlation(hne, genes_predictive_of_phenotypes, rr9_all_df=rr9_all_df)Data Analysis Correlation Between TUNEL Assay and RNASeq
The reason to select TUNEL for correlation analysis is that TUNEL assay points seem to separate out cleaner on the PCA plots than HNE data points between different groups. TUNEL Assay has data across four different groups (F, GC, Viv, CC2) and RNASeq is available for two different groups (F and GC). Among these F15-F20 and GC15-20 have data for both TUNEL and RNASeq.
To be able to anchor the imputation of RNASeq data onto biological characteristic, the correlation between RNASeq and TUNEL needs to be determined.
Code
analyze_correlation(tunel, found_genes, rr9_all_df=rr9_all_df)
analyze_correlation(tunel, genes_predictive_of_phenotypes, rr9_all_df=rr9_all_df)Imputation of Relevant Genes from Tunel data
First step is to see how many genes of the interested gene list do not have data. Imputation will be done in two groups: F(light) and not F(light). Samples F9 and F11 have RNASeq values, but not TUNEL assay values.
Code
flight_relevant_genes = rr9_all_df[rr9_all_df['Group'] == 'F'][found_genes+genes_predictive_of_phenotypes + ['Source Name', 'Group']]
non_flight_relevant_genes = rr9_all_df[rr9_all_df['Group'] != 'F'][found_genes+genes_predictive_of_phenotypes + ['Source Name', 'Group']]
flight_tunel_data = rr9_all_df[rr9_all_df['Group'] == 'F'][tunel.columns.to_list() + ['Source Name', 'Group']]
non_flight_tunel_data = rr9_all_df[rr9_all_df['Group'] != 'F'][tunel.columns.to_list() + ['Source Name', 'Group']]
merged_flight_data = pd.merge(
flight_relevant_genes,
flight_tunel_data,
on=['Source Name', 'Group'],
how='outer'
)
merged_non_flight_data = pd.merge(
non_flight_relevant_genes,
non_flight_tunel_data,
on=['Source Name', 'Group'],
how='outer'
)Establishing a Classifier to Validate Imputation for TUNEL and RNASeq Datasets To validate the effectiveness of imputation, we propose building a binary classifier to distinguish between flight and non-flight samples using the complete RNASeq and TUNEL datasets. After imputing missing values in the TUNEL and RNASeq datasets, we will train the same classifier on the imputed data and compare its performance metrics (e.g., accuracy, precision, recall, F1-score) with those obtained from the complete datasets. This comparison will help assess whether the imputation process preserves the integrity and predictive power of the data.
KNN Imputer
Flight Data
Code
imp_knn5 = KNNImputer(n_neighbors=5, weights='distance')
imp_df_knn5 = imp_knn5.fit_transform(merged_flight_data.drop(columns=['Source Name', 'Group']))
imp_df_knn5 = pd.DataFrame(imp_df_knn5, columns=merged_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
imp_df_knn5['Source Name'] = merged_flight_data['Source Name']
imp_df_knn5['Group'] = merged_flight_data['Group']
imp_knn2 = KNNImputer(n_neighbors=2, weights='distance')
imp_df_knn2 = imp_knn2.fit_transform(merged_flight_data.drop(columns=['Source Name', 'Group']))
imp_df_knn2 = pd.DataFrame(imp_df_knn2, columns=merged_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
imp_df_knn2['Source Name'] = merged_flight_data['Source Name']
imp_df_knn2['Group'] = merged_flight_data['Group']
fig = px.imshow(imp_df_knn5.corr(numeric_only=True), text_auto=True)
fig.show()
fig = px.imshow(imp_df_knn2.corr(numeric_only=True), text_auto=True)
fig.show()
corr_knn_2_matrix = imp_df_knn2.corr(numeric_only=True)
corr_knn_2_df = corr_knn_2_matrix.unstack().reset_index()
corr_knn_2_df.rename(columns={'level_0': 'para_1', 'level_1':'para_2',
0:'corr_coef_knn'}, inplace=True)/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
invalid value encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
invalid value encountered in matmul
Non-Flight Data
Code
imp_knn5 = KNNImputer(n_neighbors=5, weights='distance')
imp_df_knn5 = imp_knn5.fit_transform(merged_non_flight_data.drop(columns=['Source Name', 'Group']))
imp_df_knn5 = pd.DataFrame(imp_df_knn5, columns=merged_non_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
imp_df_knn5['Source Name'] = merged_non_flight_data['Source Name']
imp_df_knn5['Group'] = merged_non_flight_data['Group']
imp_knn2 = KNNImputer(n_neighbors=2, weights='distance')
imp_df_knn2 = imp_knn2.fit_transform(merged_non_flight_data.drop(columns=['Source Name', 'Group']))
imp_df_knn2 = pd.DataFrame(imp_df_knn2, columns=merged_non_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
imp_df_knn2['Source Name'] = merged_non_flight_data['Source Name']
imp_df_knn2['Group'] = merged_non_flight_data['Group']
fig = px.imshow(imp_df_knn5.corr(numeric_only=True), text_auto=True)
fig.show()
fig = px.imshow(imp_df_knn2.corr(numeric_only=True), text_auto=True)
fig.show()/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
invalid value encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:205: RuntimeWarning:
invalid value encountered in matmul
Random Sample Imputer
This is used in cases where there is more than 25-30% of data to be imputed and is also fast compared to others. ### Flight Data
Code
rsi = RandomSampleImputer()
rsi_df = rsi.fit_transform(merged_flight_data.drop(columns=['Source Name', 'Group']))
rsi_df = pd.DataFrame(rsi_df, columns=merged_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
rsi_df['Source Name'] = merged_flight_data['Source Name']
rsi_df['Group'] = merged_flight_data['Group']
fig = px.imshow(rsi_df.corr(numeric_only=True), text_auto=True)
fig.show()
corr_rsi_matrix = rsi_df.corr(numeric_only=True)
corr_rsi_df = corr_rsi_matrix.unstack().reset_index()
corr_rsi_df.rename(columns={'level_0': 'para_1', 'level_1':'para_2',
0:'corr_coef_rsi'}, inplace=True)Non-Flight Data
Code
rsi = RandomSampleImputer()
rsi_df = rsi.fit_transform(merged_non_flight_data.drop(columns=['Source Name', 'Group']))
rsi_df = pd.DataFrame(rsi_df, columns=merged_non_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
rsi_df['Source Name'] = merged_non_flight_data['Source Name']
rsi_df['Group'] = merged_non_flight_data['Group']
fig = px.imshow(rsi_df.corr(numeric_only=True), text_auto=True)
fig.show()Multiple Imputation by Chained Equation
One can impute missing values by predicting them using other features from the dataset.
The MICE or ‘Multiple Imputations by Chained Equations’, aka, ‘Fully Conditional Specification’ is a popular approach to do this.
Here is a quick intuition (not the exact algorithm) 
You basically take the variable that contains missing values as a response ‘Y’ and other variables as predictors ‘X’.
Build a model with rows where Y is not missing.
Then predict the missing observations.
Do this multiple times by doing random draws of the data and taking the mean of the predictions.
Flight Data
Code
lgbr = HistGradientBoostingRegressor(random_state=2)
itera_imp = IterativeImputer(random_state=2, initial_strategy='median', estimator=lgbr, max_iter=10, verbose=2)
itera_imp.fit(merged_flight_data.drop(columns=['Source Name', 'Group']))
df_imputed = itera_imp.transform(merged_flight_data.drop(columns=['Source Name', 'Group']))
df_imputed = pd.DataFrame(df_imputed, columns=merged_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
df_imputed['Source Name'] = merged_flight_data['Source Name']
df_imputed['Group'] = merged_flight_data['Group']
corr_matrix = df_imputed.drop(columns=['Source Name','Group']).corr()
# Plot the heatmap
# fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(corr_matrix, cmap='coolwarm')
plt.show()
corr_mice_df = corr_matrix.unstack().reset_index()
corr_mice_df.rename(columns={'level_0': 'para_1', 'level_1':'para_2',
0:'corr_coef_mice_boost'}, inplace=True)[IterativeImputer] Completing matrix with shape (20, 82)
[IterativeImputer] Ending imputation round 1/10, elapsed time 1.60
[IterativeImputer] Change: 1397.3284127942497, scaled tolerance: 65.6542740629091
[IterativeImputer] Ending imputation round 2/10, elapsed time 3.10
[IterativeImputer] Change: 0.0, scaled tolerance: 65.6542740629091
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (20, 82)
[IterativeImputer] Ending imputation round 1/2, elapsed time 0.25
[IterativeImputer] Ending imputation round 2/2, elapsed time 0.50
Non-Flight Data
Code
lgbr = HistGradientBoostingRegressor(random_state=2)
itera_imp = IterativeImputer(random_state=2, initial_strategy='median', estimator=lgbr, max_iter=10, verbose=2)
itera_imp.fit(merged_non_flight_data.drop(columns=['Source Name', 'Group']))
df_imputed = itera_imp.transform(merged_non_flight_data.drop(columns=['Source Name', 'Group']))
df_imputed = pd.DataFrame(df_imputed, columns=merged_non_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
df_imputed['Source Name'] = merged_non_flight_data['Source Name']
df_imputed['Group'] = merged_non_flight_data['Group']
corr_matrix = df_imputed.drop(columns=['Source Name','Group']).corr()
sns.heatmap(corr_matrix, cmap='coolwarm')
plt.show()[IterativeImputer] Completing matrix with shape (80, 82)
[IterativeImputer] Ending imputation round 1/10, elapsed time 1.56
[IterativeImputer] Change: 1308.8961960633826, scaled tolerance: 57.9240767404965
[IterativeImputer] Ending imputation round 2/10, elapsed time 3.25
[IterativeImputer] Change: 0.0, scaled tolerance: 57.9240767404965
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (80, 82)
[IterativeImputer] Ending imputation round 1/2, elapsed time 0.26
[IterativeImputer] Ending imputation round 2/2, elapsed time 0.51
MICE with Bagging Regressor
Flight Data
Code
bagger = BaggingRegressor(random_state=2)
itera_bagger = IterativeImputer(random_state=2, initial_strategy='median', estimator=bagger, max_iter=50, verbose=2, tol=0.01)
itera_bagger.fit(merged_flight_data.drop(columns=['Source Name', 'Group']))
df_bag_imputed_flight = itera_bagger.transform(merged_flight_data.drop(columns=['Source Name', 'Group']))
df_bag_imputed_flight = pd.DataFrame(df_bag_imputed_flight, columns=merged_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
df_bag_imputed_flight['Source Name'] = merged_flight_data['Source Name']
df_bag_imputed_flight['Group'] = merged_flight_data['Group']
corr_matrix = df_bag_imputed_flight.drop(columns=['Source Name','Group']).corr()
sns.heatmap(corr_matrix, cmap='coolwarm')
plt.show()
corr_mice_bag_df = corr_matrix.unstack().reset_index()
corr_mice_bag_df.rename(columns={'level_0': 'para_1', 'level_1':'para_2',
0:'corr_coef_mice_bag'}, inplace=True)[IterativeImputer] Completing matrix with shape (20, 82)
[IterativeImputer] Ending imputation round 1/50, elapsed time 0.37
[IterativeImputer] Change: 1940.5475771234073, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 2/50, elapsed time 0.73
[IterativeImputer] Change: 2304.496631229626, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 3/50, elapsed time 1.09
[IterativeImputer] Change: 1625.7635839128618, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 4/50, elapsed time 1.45
[IterativeImputer] Change: 1588.4052974920094, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 5/50, elapsed time 1.82
[IterativeImputer] Change: 1499.2879428183173, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 6/50, elapsed time 2.34
[IterativeImputer] Change: 3676.7275189820207, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 7/50, elapsed time 2.70
[IterativeImputer] Change: 2766.8676557055273, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 8/50, elapsed time 3.07
[IterativeImputer] Change: 1374.8342667369056, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 9/50, elapsed time 3.43
[IterativeImputer] Change: 1022.7756242835043, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 10/50, elapsed time 3.80
[IterativeImputer] Change: 2894.733169946128, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 11/50, elapsed time 4.16
[IterativeImputer] Change: 1645.066090994783, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 12/50, elapsed time 4.52
[IterativeImputer] Change: 1030.3675564056448, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 13/50, elapsed time 4.89
[IterativeImputer] Change: 1359.6359505201567, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 14/50, elapsed time 5.25
[IterativeImputer] Change: 1267.411820863997, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 15/50, elapsed time 5.61
[IterativeImputer] Change: 2170.6640078101373, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 16/50, elapsed time 5.97
[IterativeImputer] Change: 813.7176539435219, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 17/50, elapsed time 6.34
[IterativeImputer] Change: 2370.635519511398, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 18/50, elapsed time 6.70
[IterativeImputer] Change: 2034.8893530764412, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 19/50, elapsed time 7.06
[IterativeImputer] Change: 922.9755206465899, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 20/50, elapsed time 7.42
[IterativeImputer] Change: 2669.106416730059, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 21/50, elapsed time 7.78
[IterativeImputer] Change: 2014.9061739975687, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 22/50, elapsed time 8.15
[IterativeImputer] Change: 1204.47742090189, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 23/50, elapsed time 8.51
[IterativeImputer] Change: 1638.5018332287048, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 24/50, elapsed time 8.86
[IterativeImputer] Change: 826.3316725026752, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 25/50, elapsed time 9.22
[IterativeImputer] Change: 1372.8200024285545, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 26/50, elapsed time 9.58
[IterativeImputer] Change: 1658.5791647704775, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 27/50, elapsed time 9.94
[IterativeImputer] Change: 3211.0818661058674, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 28/50, elapsed time 10.31
[IterativeImputer] Change: 1718.1069301676891, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 29/50, elapsed time 10.67
[IterativeImputer] Change: 2452.6308873171165, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 30/50, elapsed time 11.03
[IterativeImputer] Change: 2619.718640221444, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 31/50, elapsed time 11.39
[IterativeImputer] Change: 1559.8835280094163, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 32/50, elapsed time 11.75
[IterativeImputer] Change: 3846.7056401012596, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 33/50, elapsed time 12.11
[IterativeImputer] Change: 2406.464190067421, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 34/50, elapsed time 12.47
[IterativeImputer] Change: 1061.3304561032555, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 35/50, elapsed time 12.83
[IterativeImputer] Change: 1933.712403397467, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 36/50, elapsed time 13.19
[IterativeImputer] Change: 2490.9413517082144, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 37/50, elapsed time 13.55
[IterativeImputer] Change: 1155.1005615916195, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 38/50, elapsed time 13.91
[IterativeImputer] Change: 1304.0341273034649, scaled tolerance: 656.542740629091
[IterativeImputer] Ending imputation round 39/50, elapsed time 14.27
[IterativeImputer] Change: 576.7314548030308, scaled tolerance: 656.542740629091
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (20, 82)
[IterativeImputer] Ending imputation round 1/39, elapsed time 0.02
[IterativeImputer] Ending imputation round 2/39, elapsed time 0.04
[IterativeImputer] Ending imputation round 3/39, elapsed time 0.06
[IterativeImputer] Ending imputation round 4/39, elapsed time 0.09
[IterativeImputer] Ending imputation round 5/39, elapsed time 0.11
[IterativeImputer] Ending imputation round 6/39, elapsed time 0.13
[IterativeImputer] Ending imputation round 7/39, elapsed time 0.15
[IterativeImputer] Ending imputation round 8/39, elapsed time 0.17
[IterativeImputer] Ending imputation round 9/39, elapsed time 0.19
[IterativeImputer] Ending imputation round 10/39, elapsed time 0.21
[IterativeImputer] Ending imputation round 11/39, elapsed time 0.23
[IterativeImputer] Ending imputation round 12/39, elapsed time 0.26
[IterativeImputer] Ending imputation round 13/39, elapsed time 0.28
[IterativeImputer] Ending imputation round 14/39, elapsed time 0.30
[IterativeImputer] Ending imputation round 15/39, elapsed time 0.32
[IterativeImputer] Ending imputation round 16/39, elapsed time 0.34
[IterativeImputer] Ending imputation round 17/39, elapsed time 0.36
[IterativeImputer] Ending imputation round 18/39, elapsed time 0.38
[IterativeImputer] Ending imputation round 19/39, elapsed time 0.40
[IterativeImputer] Ending imputation round 20/39, elapsed time 0.43
[IterativeImputer] Ending imputation round 21/39, elapsed time 0.45
[IterativeImputer] Ending imputation round 22/39, elapsed time 0.47
[IterativeImputer] Ending imputation round 23/39, elapsed time 0.49
[IterativeImputer] Ending imputation round 24/39, elapsed time 0.51
[IterativeImputer] Ending imputation round 25/39, elapsed time 0.53
[IterativeImputer] Ending imputation round 26/39, elapsed time 0.55
[IterativeImputer] Ending imputation round 27/39, elapsed time 0.58
[IterativeImputer] Ending imputation round 28/39, elapsed time 0.60
[IterativeImputer] Ending imputation round 29/39, elapsed time 0.62
[IterativeImputer] Ending imputation round 30/39, elapsed time 0.64
[IterativeImputer] Ending imputation round 31/39, elapsed time 0.66
[IterativeImputer] Ending imputation round 32/39, elapsed time 0.68
[IterativeImputer] Ending imputation round 33/39, elapsed time 0.70
[IterativeImputer] Ending imputation round 34/39, elapsed time 0.72
[IterativeImputer] Ending imputation round 35/39, elapsed time 0.75
[IterativeImputer] Ending imputation round 36/39, elapsed time 0.77
[IterativeImputer] Ending imputation round 37/39, elapsed time 0.79
[IterativeImputer] Ending imputation round 38/39, elapsed time 0.81
[IterativeImputer] Ending imputation round 39/39, elapsed time 0.83
Non-Flight Data
Code
bagger = BaggingRegressor(random_state=2)
itera_bagger = IterativeImputer(random_state=2, initial_strategy='median', estimator=bagger, max_iter=50, verbose=2, tol=0.01)
itera_bagger.fit(merged_non_flight_data.drop(columns=['Source Name', 'Group']))
df_bag_imputed_non_flight = itera_bagger.transform(merged_non_flight_data.drop(columns=['Source Name', 'Group']))
df_bag_imputed_non_flight = pd.DataFrame(df_bag_imputed_non_flight, columns=merged_non_flight_data.drop(columns=['Source Name', 'Group']).columns.to_list())
df_bag_imputed_non_flight['Source Name'] = merged_non_flight_data['Source Name']
df_bag_imputed_non_flight['Group'] = merged_non_flight_data['Group']
corr_matrix = df_bag_imputed_non_flight.drop(columns=['Source Name','Group']).corr()
sns.heatmap(corr_matrix, cmap='coolwarm')
plt.show()[IterativeImputer] Completing matrix with shape (80, 82)
[IterativeImputer] Ending imputation round 1/50, elapsed time 0.37
[IterativeImputer] Change: 2569.9775989559485, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 2/50, elapsed time 0.74
[IterativeImputer] Change: 1738.3965212773273, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 3/50, elapsed time 1.10
[IterativeImputer] Change: 1784.599674194284, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 4/50, elapsed time 1.59
[IterativeImputer] Change: 1232.502256362552, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 5/50, elapsed time 1.95
[IterativeImputer] Change: 1128.665506522908, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 6/50, elapsed time 2.32
[IterativeImputer] Change: 957.8457386847183, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 7/50, elapsed time 2.69
[IterativeImputer] Change: 1102.6736093008915, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 8/50, elapsed time 3.06
[IterativeImputer] Change: 942.8626478046449, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 9/50, elapsed time 3.42
[IterativeImputer] Change: 1055.5969872705493, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 10/50, elapsed time 3.79
[IterativeImputer] Change: 1145.3163537970627, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 11/50, elapsed time 4.16
[IterativeImputer] Change: 881.029980170526, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 12/50, elapsed time 4.53
[IterativeImputer] Change: 863.8717239109001, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 13/50, elapsed time 4.90
[IterativeImputer] Change: 1049.9249106024458, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 14/50, elapsed time 5.27
[IterativeImputer] Change: 1433.3014110197973, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 15/50, elapsed time 5.63
[IterativeImputer] Change: 1522.1510364979663, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 16/50, elapsed time 6.00
[IterativeImputer] Change: 1520.6533004262951, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 17/50, elapsed time 6.37
[IterativeImputer] Change: 1465.349631355589, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 18/50, elapsed time 6.74
[IterativeImputer] Change: 944.4455327277362, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 19/50, elapsed time 7.10
[IterativeImputer] Change: 839.1400420875864, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 20/50, elapsed time 7.47
[IterativeImputer] Change: 786.7904174031114, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 21/50, elapsed time 7.84
[IterativeImputer] Change: 1100.4747530937366, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 22/50, elapsed time 8.21
[IterativeImputer] Change: 1049.3007503154654, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 23/50, elapsed time 8.58
[IterativeImputer] Change: 1205.8009388274622, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 24/50, elapsed time 8.94
[IterativeImputer] Change: 1278.3356786001968, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 25/50, elapsed time 9.31
[IterativeImputer] Change: 1139.744734496802, scaled tolerance: 579.240767404965
[IterativeImputer] Ending imputation round 26/50, elapsed time 9.69
[IterativeImputer] Change: 494.5048673581032, scaled tolerance: 579.240767404965
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (80, 82)
[IterativeImputer] Ending imputation round 1/26, elapsed time 0.03
[IterativeImputer] Ending imputation round 2/26, elapsed time 0.06
[IterativeImputer] Ending imputation round 3/26, elapsed time 0.08
[IterativeImputer] Ending imputation round 4/26, elapsed time 0.11
[IterativeImputer] Ending imputation round 5/26, elapsed time 0.14
[IterativeImputer] Ending imputation round 6/26, elapsed time 0.16
[IterativeImputer] Ending imputation round 7/26, elapsed time 0.19
[IterativeImputer] Ending imputation round 8/26, elapsed time 0.21
[IterativeImputer] Ending imputation round 9/26, elapsed time 0.24
[IterativeImputer] Ending imputation round 10/26, elapsed time 0.26
[IterativeImputer] Ending imputation round 11/26, elapsed time 0.29
[IterativeImputer] Ending imputation round 12/26, elapsed time 0.32
[IterativeImputer] Ending imputation round 13/26, elapsed time 0.34
[IterativeImputer] Ending imputation round 14/26, elapsed time 0.37
[IterativeImputer] Ending imputation round 15/26, elapsed time 0.39
[IterativeImputer] Ending imputation round 16/26, elapsed time 0.42
[IterativeImputer] Ending imputation round 17/26, elapsed time 0.44
[IterativeImputer] Ending imputation round 18/26, elapsed time 0.47
[IterativeImputer] Ending imputation round 19/26, elapsed time 0.49
[IterativeImputer] Ending imputation round 20/26, elapsed time 0.52
[IterativeImputer] Ending imputation round 21/26, elapsed time 0.55
[IterativeImputer] Ending imputation round 22/26, elapsed time 0.57
[IterativeImputer] Ending imputation round 23/26, elapsed time 0.60
[IterativeImputer] Ending imputation round 24/26, elapsed time 0.62
[IterativeImputer] Ending imputation round 25/26, elapsed time 0.65
[IterativeImputer] Ending imputation round 26/26, elapsed time 0.67
Checking The Correlations
Flight Data
SignificanceResult(statistic=np.float64(0.986830117455662), pvalue=np.float64(0.0))
SignificanceResult(statistic=np.float64(0.6872370858719213), pvalue=np.float64(0.0))
SignificanceResult(statistic=np.float64(0.9956665262301658), pvalue=np.float64(0.0))
SignificanceResult(statistic=np.float64(0.979579624050381), pvalue=np.float64(0.0))
SVM for Validating Imputations
Code
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
def train_svm_classifier(data, label_col='Group'):
"""
Train an SVM classifier to distinguish between flight and non-flight samples.
"""
# Prepare features and labels
X = data.drop(columns=['Source Name', label_col])
y = data[label_col].apply(lambda x: 1 if x == 'F' else 0) # Binary classification: Flight (1) vs Non-Flight (0)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# Train the SVM classifier
svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train, y_train)
# Evaluate the classifier
y_pred = svm.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
return svmCode
# Train and evaluate SVM on RNASeq staining data
print("SVM on RNA Seq Data:")
rnaseq_data = rr9_all_df[['Source Name', 'Group'] + rnaseq.columns.tolist()]
# Drop rows with NaN values
rnaseq_data_cleaned = rnaseq_data.dropna()
train_svm_classifier(rnaseq_data_cleaned)SVM on RNA Seq Data:
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 3
1 1.00 1.00 1.00 2
accuracy 1.00 5
macro avg 1.00 1.00 1.00 5
weighted avg 1.00 1.00 1.00 5
Accuracy: 1.0
SVC(kernel='linear', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear', random_state=42)
Code
# Train and evaluate SVM on HNE staining data
print("SVM on HNE Staining Data:")
hne_data = rr9_all_df[['Source Name', 'Group'] + hne.columns.tolist()]
# Drop rows with NaN values
hne_data_cleaned = hne_data.dropna()
train_svm_classifier(hne_data_cleaned)SVM on HNE Staining Data:
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 5
1 1.00 1.00 1.00 2
accuracy 1.00 7
macro avg 1.00 1.00 1.00 7
weighted avg 1.00 1.00 1.00 7
Accuracy: 1.0
SVC(kernel='linear', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear', random_state=42)
Code
# Train and evaluate SVM on TUNEL data
print("SVM on HNE Staining Data:")
tunel_data = rr9_all_df[['Source Name', 'Group'] + tunel.columns.tolist()]
# Drop rows with NaN values
tunel_data_cleaned = tunel_data.dropna()
train_svm_classifier(tunel_data_cleaned)SVM on HNE Staining Data:
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 5
1 1.00 1.00 1.00 2
accuracy 1.00 7
macro avg 1.00 1.00 1.00 7
weighted avg 1.00 1.00 1.00 7
Accuracy: 1.0
SVC(kernel='linear', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear', random_state=42)
::: {#10b55d95 .cell execution_count=34}
``` {.python .cell-code}
# Train and evaluate SVM on MICE with Bagging Regressor-imputed data
print("SVM on MICE with Bagging Regressor-Imputed Data:")
mice_bag_imputed_data = pd.concat([df_bag_imputed_flight, df_bag_imputed_non_flight], axis=0)
# Drop rows with NaN values in the 'Group' column
mice_bag_imputed_data = mice_bag_imputed_data.dropna(subset=['Group'])
train_svm_classifier(mice_bag_imputed_data)
SVM on MICE with Bagging Regressor-Imputed Data:
Classification Report:
precision recall f1-score support
0 0.96 0.96 0.96 24
1 0.83 0.83 0.83 6
accuracy 0.93 30
macro avg 0.90 0.90 0.90 30
weighted avg 0.93 0.93 0.93 30
Accuracy: 0.9333333333333333
SVC(kernel='linear', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear', random_state=42)
:::
Use embeddings to Impute
filename Mice Name \
0 /kaggle/working/data2/CH1/Chc20_RR9_Ret_20X_PN... Chc20
1 /kaggle/working/data2/CH1/Chc20_RR9_Ret_20X_PN... Chc20
2 /kaggle/working/data2/CH1/Chc20_RR9_Ret_20X_PN... Chc20
3 /kaggle/working/data2/CH1/Chc20_RR9_Ret_20X_PN... Chc20
4 /kaggle/working/data2/CH1/Chc20_RR9_Ret_20X_PN... Chc20
.. ... ...
178 /kaggle/working/data2/CH1/VG20_RR9_Ret_20X_PNA... CC220
179 /kaggle/working/data2/CH1/VG20_RR9_Ret_20X_PNA... CC220
180 /kaggle/working/data2/CH1/VG20_RR9_Ret_20X_PNA... CC220
181 /kaggle/working/data2/CH1/VG20_RR9_Ret_20X_PNA... CC220
182 /kaggle/working/data2/CH1/VG20_RR9_Ret_20X_PNA... CC220
Staining Technique Channel
0 Ret CH1
1 Ret CH1
2 Ret CH1
3 Ret CH1
4 Ret CH1
.. ... ...
178 Ret CH1
179 Ret CH1
180 Ret CH1
181 Ret CH1
182 Ret CH1
[183 rows x 4 columns]
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:337: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:337: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:337: RuntimeWarning:
invalid value encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:338: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:338: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:338: RuntimeWarning:
invalid value encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:342: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:342: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:342: RuntimeWarning:
invalid value encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:529: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:529: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:529: RuntimeWarning:
invalid value encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:543: RuntimeWarning:
divide by zero encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:543: RuntimeWarning:
overflow encountered in matmul
/Users/vaishnavinagesh/Desktop/AI-ML_AWG/.venv/lib/python3.12/site-packages/sklearn/utils/extmath.py:543: RuntimeWarning:
invalid value encountered in matmul